Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $7 billion in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request. Investors can consider borrowers’ credit scores, ratings, and histories and the category of the loan. Prosper handles the servicing of the loan and collects and distributes borrower payments and interest back to the loan investors.
Prosper verifies borrowers’ identities and select personal data before funding loans and manages all stages of loan servicing. Prosper’s unsecured personal loans are fully amortized over a period of three or five years, with no pre-payment penalties. Prosper generates revenue by collecting a one-time fee on funded loans from borrowers and assessing an annual loan servicing fee to investors.
The dataset is provided by Udacity as part of Nanodegree program - project work. We will be exploring the dataset using R language. Lets explore the structure of the dataset first.
## Dimensions :
## Rows: 113937 Columns: 81
The dataset has 113937 records and 81 columns. Exploring 81 columns is a tremendous task. We will limit ourselves to 15 or 20 columns as part of our analysis. A little bit information about variables we will be using in our analysis.
Before we begin our analysis we need to make some changes to our dataset. We will first filter our dataset, so that it will have just the columns we need.
## 'data.frame': 113937 obs. of 20 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ Occupation : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
## $ EmploymentStatus : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
From the structure of the dataset, we can observe
We will convert Term from months to years and store it under new column Term_Yrs and also the Listing Category description will be stored in ListingCategory..alpha. column.
## Term before change:
## 36 36 36 36 36 60
## Term after change:
## 2 2 2 2 2 3
## Unique values in Listing Category:
## 0 2 16 1 7 13 6 15 20 19 3 18 8 4 11 14 5 9 17 10 12
## Variables in data frame:
##
## [1] "Term" "LoanStatus"
## [3] "BorrowerAPR" "BorrowerRate"
## [5] "EstimatedReturn" "ProsperRating..numeric."
## [7] "ProsperRating..Alpha." "ProsperScore"
## [9] "ListingCategory..numeric." "Occupation"
## [11] "EmploymentStatus" "CreditScoreRangeLower"
## [13] "CreditScoreRangeUpper" "BankcardUtilization"
## [15] "AvailableBankcardCredit" "DebtToIncomeRatio"
## [17] "IncomeRange" "StatedMonthlyIncome"
## [19] "LoanOriginalAmount" "Investors"
## [21] "Term_Yrs" "ListingCategory..alpha."
We can see that the new columns are now present in our dataset. Let’s explore.
## Summary by Term:
## 1 3 5
## 1614 87778 24545
This suggests that more people get a loan with a term of 3 years followed by 5 year loans.
The plot is self explanatory. There are more loans that are current/live/running than other loans. About 38000 loans have been completed/closed which presents a good picture. The loans have been payed back successfully without defaulting. The number of loans in Charge-off category is huge. What is Charge-off?
Charged-Off: If your loan becomes more than 120 days past due (you missed the last 5 monthly payments), your loan will be considered “charged-off.” This does not mean your loan has been excused or forgiven. You are still obligated to make payments. When a loan is charged off, the entire balance is accelerated, meaning it is collectible in full as of the charge-off date.
-Prosper
The high number of Charge-Off loans though paint a not-so-good picture. Let us check how Loan Status is distributed with respect to Term.
Clearly the short term loan(1 year) has a bit better performance, since most of the 1 year loans are completed or in final payment stage. The number of loans that are in delinquency is very less when compared to other term loans.
Prosper Rating: It is a proprietary rating figure provided to prospective borrowers based on the company’s estimation of that borrower’s “estimated loss rate”. According to the company, that figure is “determined by two scores: (1) the credit score, obtained from an official credit reporting agency, and (2) the Prosper Score, figured in-house based on the Prosper population.” Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”)
-Wiki
Prosper Ratings allow potential investors to easily consider a loan application’s level of risk because the rating represents an estimated average annualized loss rate range to the investor.
There are many loans in the rating category ‘C’. But surprisingly, the number of loans under “High Risk” is more than the number of loans under “AA” which is the top category.
Prosper Score:A custom risk score was built using historical Prosper data to assess the risk of Prosper borrower listings. The output to Prosper users is a Prosper score which ranges from 1 to 11, with 11 being the best, or lowest risk, score. The worst, or highest risk, score, is a 1. Prosper uses both the custom score and the credit reporting agency score together to assess the borrower’s level of risk and determine estimated loss rates. The loss estimates are based on the historical performance of Prosper loans to borrowers with similar characteristics.
-Prosper
The Prosper score estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination.
The number of loans with Score of 4,6 & 8 seem to be higher than the number of loans in other scores. In general, the number of loans with scores 4-8 are more than the loans in other score categories.
There is a clear pattern that we can see in the above plot. As Prosper Score increases from 1 to 11, the number of higher rated loans increases. There seems to be a positive relationship between Prosper Score and Prosper Rating. Let’s check this out using a boxplot.
The evidence is very clear. There is correlation between Prosper Rating & Prosper Score. The relationship is positive. The correlation co-efficient is 0.70 for Pearson’s Correlation. The details are below.
##
## Pearson's product-moment correlation
##
## data: df$ProsperScore and df$ProsperRating..numeric.
## t = 289.74, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7018231 0.7085876
## sample estimates:
## cor
## 0.7052214
It is the occupation stated by the borrower at the time listing was created.
As we saw earlier, there are 67 types of Occupations found in our dataset. The most common Occupation specified by borrowers is ‘Others’. Lets checkout the number of loans for the most common and the least common occupations. To do this, we will calculate the mean, median scores & count and store in a separate dataframe. This will be used for our analysis.
## Structure of new dataframe:
## # A tibble: 6 x 4
## Occupation mean_score median_score n
## <fct> <dbl> <dbl> <int>
## 1 Other 5.76 6 21317
## 2 Professional 6.21 6 10542
## 3 Executive 6.62 7 3468
## 4 Computer Programmer 6.88 7 3236
## 5 Teacher 5.86 6 2888
## 6 Analyst 6.51 7 2735
Will Prosper Score vary by occupation. How will the top 10 and bottom 10 Occupations perform in Prosper Score?
The green bar shows the median prosper score while the black dot represents the mean prosper score for the same Occupation. Surprisingly, Students from Technical School have the highest Prosper Score. However the prosper score of Students from College have one of the least scores.
Credit Score Range is provided as two columns in the dataset.
If we check the variable for unique values, we get this
## Unique values in Lower Range:
## [1] 640 680 480 800 740 700 820 760 660 620 720 520 780 600 580 540 560
## [18] 500 840 860 NA 460 0 880 440 420 360
## Unique values in Upper Range:
## [1] 659 699 499 819 759 719 839 779 679 639 739 539 799 619 599 559 579
## [18] 519 859 879 NA 479 19 899 459 439 379
So we will factor it, and plot the range.
From the plots, we can see a similarity between the lower and upper scores. The difference between the upper and lower range values seem to be around 20 points. This is an expected outcome. If the lower range value of Credit Score is high, it means that the higher range value of Credit Score will also be high. This is clearly evident from the below plot.
Clearly, with increase in Prosper Rating, there is increase in Credit Score.
Since we know that Prosper Rating and Prosper Score are positively correlated, we can expect a similar output as to the previous one. Lets see.
Yes. With increase in Prosper Score, the Credit Score also increases.
Bank Card Utilization: Credit card utilization - or just credit utilization, for short - refers to how much of your available credit you use at any given time. The credit utilization ratio is a component used by credit reporting agencies in calculating a borrower’s credit score. Lowering the credit utilization ratio can help a borrower to improve their credit score.
Investopedia
There are more number of borrowers who have higher Card Utilization rate, meaning more borrowers are using their available credit to its limit.
From its definition, Card Utilization has a negative influence on borrower’s credit score and rating. We will plot the mean and median Card Utilization values for each Prosper Score. I would expect a negative relationship.
Surely, the relationship is negative. As Prosper Score increases, the Card Utilization rate is decreasing. But we should also consider other hidden factors. The Total Available Credit which forms the base for the Utilization Ratio may be less or more. This could influence the graph.
Available Credit: Your available credit is the amount you can use for purchases at any given time. It is the difference between your total credit limit and your account balance.
Lendedu
## Summary on Available Credit:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646285 7544
## Available Credit at 99% percentile:
## 99%
## 89005.32
We can see that there is a huge gap between the maximum credit and third quartile(75%). We can see that 99% of values lie below 89005. We will use this information to filter our data to plot the 99% of Available Credit Values.
There are more number of borrowers having lesser Available Credit. From summary we know that the median value is 4100.
What is the relationship between Available Credit and Prosper Score?
With increase in Prosper Score, the Available Credit also increases, which means the Utilization Ratio decreases or borrowers with a better Prosper Score tend to Utilize less amount from their Total Credit Limit and hence, their Available Credit Amount will be high.
We will try plotting the relationship between Available Credit and Card Utilization Ratio along with Prosper Score. Let us see if we can find any relationship.
From the plot, we can see that the values are dispersed. We cannot find a clear relationship pattern between them. However, we can see that higher prosper score borrowers have a higher Available Credit in comparison to lower score borrowers. Also lower score borrowers in general have a lower Availale Credit balance. The line shows the median Available Credit for Card Utilization values.
Debt-To-Income: The debt-to-income (DTI) ratio is a personal finance measure that compares an individual’s debt payment to his or her overall income. The debt-to-income ratio is one way lenders, including mortgage lenders, measure an individual’s ability to manage monthly payment and repay debts. DTI is calculated by dividing total recurring monthly debt by gross monthly income, and it is expressed as a percentage. A low debt-to-income (DTI) ratio demonstrates a good balance between debt and income. Conversely, a high DTI can signal that an individual has too much debt for the amount of income he or she has.
-Investopedia
Summary of DTI in our dataset
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1400 0.2200 0.2759 0.3200 10.0100
We will plot the variable.
There are less number of borrowers whose DTI is greater than 1. Most of the borrowers have a DTI value below 1 and between 0.14 & 0.32. For our analysis we will concentrate on DTI scores below 1.
We will study the relationship between DTI and Prosper Score. We will calculate the median DTI value for each score and plot them.
## Median DTI values for Prosper Scores
## # A tibble: 11 x 2
## ProsperScore median_dir
## <dbl> <dbl>
## 1 1 0.33
## 2 2 0.27
## 3 3 0.28
## 4 4 0.27
## 5 5 0.25
## 6 6 0.24
## 7 7 0.21
## 8 8 0.19
## 9 9 0.17
## 10 10 0.16
## 11 11 0.19
The median DTI value keeps decreasing as Prosper Score increases. The interesting thing is for the highest score-11, the median DTI is greater than the median DTI of scores 9 & 10
The spread of DTI seems to be inversely proprotional to Prosper Score. With lower scores, the DTI values are more spread in the graph. As the score increases, the graph becomes more concentrated with less spead.
It is the income range of the borrower. The available income ranges are
## Income Ranges in dataset:
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999 $75,000-99,999 Not displayed Not employed
Lets plot it out!
Most borrowers are in the income range of $25,000 to $75,000. Also the number of borrowers from $100,000+ category seems to be slightly higher than $75,000 to $99,999 category.
We have already seen that more number of loans are with 3 year term. 5 year term loans have the second place. Here, we see the same reflected. There’s more 3 year term loans. Looking at how they are distributed, category $1-$24,999 has very few 1 year term loan in comparison to other income categories. Also category $50,000-$74,999 has more number of 5 year loans in comparison.
How is the income range distributed across Prosper Score. Does higher Prosper Score have more people from particular Income group ? Let’s find out.
Yes. We can see that for lower scores, the lower income ranges lead the higher income range, meaning the number of loans with lower prosper score is more for lower income range borrowers. But as score increases, slowly the higher income group takes the lead.
How about Debt-To-Income ratio and Income Range ? Do lower income range borrowers have higher DTI value ?
The income range of $1 to $24,999 seems to have more number of borrowers with DTI value greater than 1. There are more borrowers who have not displayed their income range but have higher DTI values. For higher income ranges, the DTI values are concentrated below the value of 1.
The monthly income stated by the borrower. Nearly 75% of borrowers have a monthly income leass than $7000. We can also see that the maximum value is around 1 million which seems hard to believe. It could be an outlier or may be an error. We will not go into it now.
## Summary of Stated Monthly Income
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
We will plot the variable.
The top 99% of borrowers are below the monthly income of $20530. Most borrowers seem to have an income range of $3000 to $5000.
In order to find the relationship, we will create a new dataframe with mean, median and count for Stated Monthly Income grouped over Income Range.
## Structure of new dataframe:
## # A tibble: 6 x 4
## IncomeRange mean_income median_income n
## <ord> <dbl> <dbl> <int>
## 1 $0 0.000134 0 621
## 2 $1-24,999 1428. 1583. 7274
## 3 $25,000-49,999 3131. 3167. 32192
## 4 $50,000-74,999 5025. 5000 31050
## 5 $75,000-99,999 7057. 7000 16916
## 6 $100,000+ 12446. 10417. 17337
We will explore the relationship between income group and stated monthly income. There should high correlation between the variables. The above chart shows that. The red line shows Median Stated Income and the black line show the Mean Stated Income. The Stated Income value increases with Increase in Income Range.
The Listing selected by borrower for loan. Specifies the reason for getting loan. There are around 20 categories in listing.
The most popular category is Debt Consolidation
We have plotted the distibution of Listing Category by Income Range. There is a similarity in Listing Category for all incomes except for $0. In $0 income group, the most popular vategory is Business followed by Personal Loan
Most of the Borrowers are Employed(working multiple jobs) or they are full-time employees.
Do people of certain employment prefer a certain term loan?
Borrowers with Full-Time employment mostly prefer a loan with 3 year term. 5 year term loans are preferred by borrowers in Employed category.
A surprise here is that, borrowers of employment status-employed have taken loans in many categories except Personal loan and Student use.
Borrower APR: Annual Percentage Rate (APR) is the cost of credit as a yearly rate.The APR figures in not just your interest rate, but also some fees associated with your loan over its lifetime. At Prosper, this means the closing fee charged when you first borrow the money. This closing fee is paid out of the loan proceeds when the loan originates.
-Prosper
The APR value of 0.36 seems to have the highest count, which is surprising because we see a spiking in dataset.
We can see that the number of Charged-Off loans are high in the Borrower APR value of 0.36. In fact it is higher than all the other APR values. The number of current loans are high for APR values 0.18 & 0.20.
What is the average APR for each income group ?
The APR value seems to be decreasing with the increasing in income range. But we notice that for “Not Employed” category the APR is high, even higher than other groupings. The income range of $0 however has a lower APR score. We have previously seen, that for the Income Range of $0, the popular Listing Category was Business followed by Personal Loan, which was unlike other Categories. This suggests that most borrowers in this Income Range are business men/women . This can also explain the lower interest rate charged for loans.
How is the Borrower APR rate distributed for Prosper Scores.
From the plot, there is a clear pattern. With increase in Prosper Score, the Borrower APR decreases. We will also check this out using boxplot.
Yes. The relationship clear. There is High Negative Correlation between Prosper Score & Borrower APR. The correlation details are below.
##
## Pearson's product-moment correlation
##
## data: df$ProsperScore and df$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6719940 -0.6645469
## sample estimates:
## cor
## -0.6682872
Now that we see the relationship, we will also add in the variable Term (in years)
There are a few things we can notice in this plot. The rate of change or slope of the median Borrower APR values in each Term is different. The slope decreases with increase in Term, meaning the median APR value changes more in 1 year loans than in 5 year loans. Also we see that there are no 1 year loans with Prosper Score 11.
Borrower Rate: Borrower/Interest rate is the amount charged, expressed as a percentage of principal, by a lender to a borrower for the use of assets. The assets borrowed could include cash, consumer goods, and large assets such as a vehicle or building.
-Investopedia
The Interest rate refers to the annual cost of a loan to a borrower and is expressed as a percentage. The interest rate does not include fees charged for the loan. The APR (Annual Percentage Rate) reflects the annual cost of a loan to a borrower including any fees charged to originate the loan.
We can see a similarity in distribution between Borrower APR and Borrower Rate. There is a spike at 0.32 similar to 0.36 in APR.
There is increase in Borrower rate with increase in Term. This is surprising! Maybe there are other factors that influence this change, the loan amount, Prosper Score or Prosper Rating, Loan Category..etc
There is variation in Interest Rate with change in category. Loans for Cosmetic Procedures have the highest Borrower Rates. Then comes Household Expenses loans.
There is high correlation between Borrower Rate and Borrower APR. This is practical since, Borrower APR is Borrower Rate + other charges.
We have plotted the relationship between Borrower Rate & Borrower APR for each Loan Term. The black lines are the best fit regression lines. The correlation results are below.
##
## Pearson's product-moment correlation
##
## data: df$BorrowerAPR and df$BorrowerRate
## t = 2347.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9897057 0.9899409
## sample estimates:
## cor
## 0.989824
Estimated Return: Estimated Return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Estimated Effective Yield is equal to the borrower interest rate: (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. The Estimated Loss Rate is the estimated principal loss on charge-offs, and is based on the historical performance of Prosper loans. All estimates are based on the historical performance of Prosper loans for borrowers with similar characteristics. The calculations of Estimated Return, Estimated Effective Yield, and Estimated Loss Rate require significant assumptions about the repayment of loans, and investors should make their own judgments with respect to the accuracy of these assumptions. Actual performance may differ from estimated performance.
Prosper
The estimated returns are negative in some cases. The nominal return is estimated to be around 6-12%.
We have significant outliers for 3 year loans. They extend in the negative too. At first sight, we notice that the estimated return is less for 1 year term loans. With increase in Term, there is increase in return.
This is an important comparison because, Borrow Rate is the interest charged for the loan which is ultimately paid by the borrower and Estimated Return is the return that the investor can expect from the loan. The estimated return is the return on investment that the investor can expect. If there is a relationship then this could guide Investor decisions on his/her investments. Let’s check it out.
On seeing the graph, we can notice clear patterns hidden. Straight lines with different slopes on top of one another. What could be the factor that will give us clarity? Lets begin with Loan duration or Term.
Aahaa! It’s clear now that the duration of loan plays a major role in this comparison. The rate of change or slope is different in each case. With increase in Borrow Rate, the estimated return is more in the case of 3 year loans. Explains why there are more number of loans under 3 year loan category. There could be other reasons for sure but this is one of the influencers.
Hmm..even now, we can see that in all 3 Loan Terms, there are straight lines of varying slope layered upon one another. We need to dig deeper! We will compute the relationship by further splitting it using Prosper Rating.
Whew! The plot has a lot of information to process.
After looking at Interest Rate and Estimated Return, we should look into Investors.
The graph is skewed with most of the values to the left, meaning the number of investors for each loan is less than 300 for most cases. We’ll look at the summary for this variable.
## Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
One can see that 75% of loans have number of investors less than 116. Also most number of loans have only 1 investor during the time of listing. We will filter and analyze loans with more than 1 investors, in order to clearly identify the pattern of data distribution.
The number of investors are more for borrowers with a high Prosper Score. But for the score of 11, this pattern seems to be broken. This could well be because of less number of loans with a Prosper Score of 11, maybe because the loan amount was less and hence less number of investor will fulfill the requirement or maybe one of various other reasons.
Most of the investors have invested on loans with an estimated return greater than 0. However we can find investments made for loans with negative returns, this could lead to a loss of their investment. The number of investors maxes out in the range of 0.05-0.11 which shows that most of the investors are interested in investing where the returns are 6-12%. With increase in returns, there is a factor of risk involved and hence the number of investors reduce.
A pattern that is observed here is that, the number of investors increase from <200(rating-HR) to >1000(rating-AA) as Prosper Rating increases.
It is the loan amount loaned to the borrower.
We can see the irregular spikes in the plot. That’s because this is Loan amount and generally people specify amounts like 10,000 & 15,000 when they apply for loan. That is what is seen here. We have spikes at 4000,10000,15000,20000 & 25000.
We will expect the number of investors to increase as Loan Amount increases. That is what we see in the plot. But for Loan Amount less than 10,000 the distribution is not clear. This could be because of the varying amounts requested for loan and also more investors investing in those loans. This can be seen in the second plot, where the plot is zoomed-in for amount < 10000.
From the plot, we can see that below 10000, there are more loans with a Term of 3 years. It also has more number of investors.
For “High Risk” loans, both the number of Investors and the loan amount is less. With increasing Prosper Rating, both Loan Amount & the number of Investors increase.
In this plot, we are examining the relationship between Prosper Score and Prosper Rating. Prosper score estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination. Prosper Score is scored 0(worst) to 11(Best).
Prosper Rating is internal rating provided by Prosper which represents an estimated average annualized loss rate range. Rating is categorized from AA(Best) to HR(Worst).
From the plot, we can see that with increase in Prosper Score, there is increase in the number of higher rated loans. For score of 1 & 2, most loans are in HR and E categories. For score 3, most loans are in D category. For scores 4 & 5, ‘C’ category loans are high. From score 6 on, we can see higher rated loans(A,AA) steadily grow and dominate the graph. Although higher rating loans are present in lower prosper scores like 6,7,8 the number of such loans is less in comparison. The Pearsons correlation r=0.705 for Prosper Score and Prosper Rating.
Available Credit is the amount that the borrower can use on credit. It is the balance amount in card that the borrower has after he has made some purchases. It is the difference between Total Credit and Credit Balance. Bank Card Utilization is the percent of Total Credit the borrower has utilized.
Now, the dataset has details about borrowers at a specific time frame. If a listing is published today, the results may be or may not be the same as the results we have from our dataset. However, in our dataset with more than 113K records, we find this relationship that with increase in Prosper Score, the Bank Card Utilization goes down. The Utilization rate for Score 11 seems to break the trend but still the margin is not huge. At the same time we see that the Available Bank Credit increases with increase in Score. This increase Bank Credit may be due to less Utilization but I’m sure that it is not the only factor. We saw in our analysis, that with increase in Prosper Score there is increase in borrowers from higher income range. There is a possibility that the borrowers increased their Credit Limit which in-turn resulted in reduced Utilization and increased Available Credit. This is just an assumption made to identify the underlying factors of influence
I could not end this project without this correlation matrix. This matrix is a storehouse of information and yet all displayed in a single plot. The highest value of correlation is 1 & -1 both found in the matrix.
Although the practical significance of these correlation scores are not much once we know the meaning of the variables, statistically proving their relationship is significant. Other than the above relationships, there are a few more that I would like to explore.
Prosper Rating and Estimated Return has a correlation of -0.7. This is significant because, this says that with increase in Rating of the loan, the expected return for investor is less. I think we should also consider one more relationship in order to fully grasp this finding. Prosper Rating and Borrower Rate has a correlation of -1. Even Prosper Rating and Borrower APR has correlation score of -1. What does this mean?
This means that loans with good rating are charged for less interest from Borrowers and ultimately, when less interest is charged, the return on invesment will also be less. Furthermore, we can see that Estimated Return and Borrower Rate has a correlation of 0.8, meaning increase in Return is directly proportional to interest rate charged. This just supports the findings we just made before.
The prospser dataset contains over 113k observations with 81 variables. The first Obstacle faced in analyzing the dataset are - Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending. Even after that, dertermining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process was tough.
However, success was found in many areas. The general analysis revealed areas of interests such as positive correlation between borrowing rate and return rate which brough up many questions concerning investment and loan rating and the perplexity of cause and effect. Also, trends were confirmed and unexpected, unknown relationships were revealed.
Addition to this dataset would enhance the validity of this analysis. A time series data for a particular borrower would be more useful in indentifying patterns on usage, borrowing, repayment, defaulting…etc. I was able to find many ‘NA’ values in the current dataset. If the number of ‘Na’ were less, this analysis would reveal even more about the dataset.